Computing Science

Research Profile: Computational Linguistics and Natural Language Processing

(Human) language can be hard to quantify in rules… There are so many irregularities. Machine learning can capture the exceptions for you. It’s powerful, powerful stuff.

- Susan Bartlett, M.Sc. Computing Science, University of Alberta

David Beck, Linguistics professor, looking over the shoulder of colleague Greg Kondrak, Computing Science professor.

Our talent for communicating is one of the things that truly makes us a unique species.

We can translate our spoken language into written language. We can effortlessly invent new words like blog or bling, and we can easily understand these new words, even if we didn’t make them up ourselves. Our capacity to change language, and to understand the changes made by others, has resulted in thousands of distinct languages developing over the course of human history. Our languages reflect our thought patterns, our cultures, and our history as a species.

And now most of these languages are dying.

David Beck, a linguistics professor at the University of Alberta (U of A), says 60-80% of the world’s languages will be extinct by the end of this century.

“Most of the world’s languages are very small, spoken by a few thousand people in small communities, and those communities are under intense pressure from dominant national languages,” says Beck.

“I feel it’s a tragedy,” he says. “It means, from a scientific perspective, we lose all this data that tells about how people migrated, how people’s minds work, how people classify the world, and how societies evolved. And for the speakers (of dying languages) it means they lose their culture, they lose their identity, and they lose their connection with the past in fundamental ways.”

Beck is documenting and studying a Mexican language on the verge of extinction called Upper Necaxa Totonac. His analysis of the language has been made much easier thanks to the help of a colleague in the U of A Department of Computing Science, professor Greg Kondrak.   

Kondrak has developed a computer program that identifies cognates. Cognates are words from different languages that probably share a common origin, like night(English), nuit (French), and nacht (German), which all likely evolved from the original form nokt.

“If you look at any two languages that are related to each other, the theory is that at some point they were the same language, back in history, and then as time progressed, populations separated, and each population developed its own standard speech,” Beck says.

Theoretically, it’s possible to trace cognates back to the very beginning of human language. And knowing the full story of how human language developed would also reveal valuable information about how the human species colonized earth and where we originally evolved.

To find out how the Upper Necaxa Totanac language fits into all of this, Beck compares the language with related languages and looks for cognates. Kondrak’s program helps him by analyzing lists of words and singling out possible cognates.

“I had already identified a number of cognates… But Greg’s program confirmed some of those, so we know it’s working,” Beck says. Kondrak’s program has even identified cognates Beck didn’t know about.

“And (his program) can do it much faster than I can,” he says, adding that if he were to compare two languages strictly by hand, it would take years rather than the minutes that Greg’s program takes. “Quite honestly, if I had to do this by hand and all by myself, I probably wouldn’t.”

In addition to helping piece together the history of human language, Kondrak’s work is being used by the U.S. Food and Drug Administration (FDA) to cut down on mistakes involving medical prescriptions.

Doctors and pharmacists occasionally give wrong prescriptions because they confuse drugs with similar-sounding names: Levoxine and Lanoxin, for example. Mistakes like these have caused people to die.

To prevent such mistakes, the FDA rejects new drug names that are too similar to existing drug names. The FDA used to accomplish this by manually comparing lists of new and old drug names. But with hundreds of new drugs joining thousands of others already on the market each year, the task was becoming way too onerous.

The job is now done by a computer program that uses two algorithms written by Kondrak. The program is more than 90% accurate and does the job much faster than humans.

Kondrak, a fluent speaker of three languages, says he is pleased that he has been able to merge his interest in language with his expertise in computing science.

Using computers to analyze human language is a branch of computing science called natural language processing (NLP).

The research of Kondrak’s students shows that there are plenty of directions to explore in NLP. Master’s student Susan Bartlett is using machine learning to automatically divide words into syllables. And master’s student Tarek Sherif is also using machine learning to automatically translate proper names, such as the names of people or cities, from Arabic to English.

“The advantage that computers have in studying language is that they can process large amounts of data very quickly,” says Kondrak. “That’s where computers can be very helpful.”

So it turns out that a new human invention, computers, can help us penetrate the complexities of a far older human invention. And maybe, with the help of computers, we’ll be able to solve language-related problems and discover things about our languages, and ourselves, that we wouldn’t have been able to otherwise.

Article and photos by Erin Ottosen, 2007.