It’s common to hear about species becoming extinct or endangered. You’ve probably Advocacy organizations protecting nature frequently appear on television promoting their cause. But how often do you hear about the endangered languages?
It is estimated that one language goes extinct every three months. The problem of language preservation is very urgent.
What does PanLex do?
PanLex supports multilingual translation. The project has built the world’s largest translation database. Currently it covers 2,500 dictionaries, 5,700 languages and 25,000,000 words. By transforming thousands of dictionaries into a single common structure, the PanLex database makes it possible to derive billions of lexical, or word-to-word, translations that are not found in any single dictionary.
In comparison with other machine translation apps like Google Translate, which translates whole sentences and texts in up to a hundred major world languages, PanLex translates individual words, but in thousands of languages.
What have I been doing for the project?
My main focus is adding new lexical translations to the database. The process of “adding” data from a dictionary to our database might seem simple. Say, there’s a dictionary in paper format. Our partners digitize it and return it back to us. Now we have to extract particular information from the dictionary using coding and incorporate it into the database. Seems easy, right? Well, for me this is the most challenging part, since I don’t have a strong computer science background, and this has been a high learning curve!
Another focus of my internship is creating transliterators. Transliteration allows the reader to sound out a word from an unfamiliar language or script by seeing it transformed into a familiar script. My part in this process is to convert some languages’ scripts from Cyrillic or Arabic into the IPA (International Phonetic Alphabet). These transliterations will support a new feature of the PanLex Translator app. Say, you need to know what the word for “kidney” is in Urdu. You type “kidney” in the translator and get abracadabra like “گردہ”. All you know is that these characters mean “kidney” in Urdu. Here comes the transliterator, which will tell you that you should pronounce the word as “gurda”.
“Don’t forget, you’re in the Bay Area”
One fun part of my internship is the location! Our office is in Berkeley, California, which means that there’s always a lot of cool stuff going on. I was lucky to be introduced to IMUG (The International Multilingual User Group) meetings, which is a forum for language technology professionals happening monthly in Silicon Valley. The ones I attended were at Netflix and Facebook. Such meetings are a good opportunity for learning about the trends in the industry and, of course, for meeting like-minded people.
Because I’m eager to explore as much as I can in the Bay Area, I couldn’t miss fulfilling my long time dream of visiting the Google headquarters.
Living in Berkeley can also surprise you with meeting people whose names you’ve only seen in the scientific articles before. I live on the same street with the famous linguist Leonard Talmy, whose works I read as a student and I’ve never even thought of meeting him in person. But here I am, having coffee with him at the local cafe and talking about weirdness of languages.
See you later, California!
Due to this internship, I got a new set of skills, enriched my knowledge about the world’s languages and writing systems, expanded my network, and got an insight into the tech world of Silicon Valley. So, there’s a couple of things I’d like to say.
Be open to all opportunities, because you never know what waits for you around the corner. Be brave. Face your fears and try to overcome them. That’s how you grow.
And stay curious. Curiosity is the driving force of personal development.