Putting the English Language on ICE

To the untrained eye scanning Caedmon's Hymn, a nine-line seventh-century poem considered the oldest surviving English text, any similarities between the monk's words and the 21s

03 June 2009

To the untrained eye scanning Caedmon's Hymn, a nine-line seventh-century poem considered the oldest surviving English text, any similarities between the monk's words and the 21st-century version of the language might be a little hard to come by.

The evolution of the English language, now the world's default language of communication, is a beast that never sleeps and has influenced exponential cultural shifts around the world.

Watching over this transformation is the International Corpus of English, or ICE, a group of language researchers who came together in 1990 with the primary aim of collecting material for online comparative studies of national or regional varieties of English across the globe.

Under the direction of University of Alberta Department of Linguistics professor chair John Newman, the preparation of the million-word Canadian component of this large collection of writings, ICE-Canada, has just been completed.

"We're trying to create a documentation of English and all its varieties in all the different countries at, more or less, one point in time, which, in this case, is the end of the 20th century," said Newman. "The hope is to correct the imbalance and attention we give different varieties of English. For instance, if you look at American English or British English published in newspapers, it doesn't give any sense of the variety around the world.

"With a completed corpus, researchers will be able to look at such things as syntax, vocabulary and, possibly, cultural differences."

Newman explains that this corpus isn't a dictionary of words in isolation but, rather, is connected to speech and texts from the more than 20 participating countries where English has a history.

"I can use Google and get you 10 million words of English in 20 minutes, but that is not genuine spoken language," he said. "As a written language, you don't know whether the people writing the language on the Internet are speakers of English as a first or second language, whether they're young, old, male, female, or where they learnt the language, so you are mixing up genres. For this project we document all that metadata about our speakers and writers."

Newman says each participating country builds its particular corpus by following a common design, as well as a common scheme for grammatical annotation to ensure compatibility. A team of linguistic graduate students at the U of A were given the time-consuming task of transcribing taped conversations, being sure to document the demographic information about each individual speaker, as well as conversation indicators like the length of pauses or when speakers spoke at the same time.

All told, each collection in ICE will document more than 300 situations where English is spoken or written. Raw conversations range from private conversations in person or over the telephone to classroom and legalese, while writing samples include everything from personal letters to completed exams.

"This sort of information is typically very difficult to study; it is not that easy to get access to that kind of language usage," said Newman.

This U of A-led project is actually a continuation of the ICE-Canada project begun in the early 1990s at Concordia University. Montreal-based researchers collected hundreds of conversations, but the project fizzled before the information was properly catalogued.

"In 2002, when I came to the university, I began to inquire about the status of all that data," said Newman. "Nothing more was happening with it at Concordia, so we took over all the responsibility for it."

It has been three hard years, but Newman's team has recently typed in the last of the one million words.

And while versions of Canadian English vary from region to region, Newman says this corpus contains a fair cross-section of English spoken in Canada and provides a nice slice in time for comparative language researchers to start picking through the data.

"Language is continually changing so, ideally, if you want to track the evolution of English you would want to put together a corpus every 50 years or so," he said, adding that although many of the corpora are still being developed, early glances have already begun to bear fruit. "We already know that Canada and New Zealand have the highest usage of the word 'snow'. The Philippines and Jamaica have highest instance of 'rain'."

Other countries involved in the project that are at various stages of completion include Australia, Fiji, Great Britain, Hong Kong, India, Ireland, Malaysia, Singapore, South Africa, Sri Lanka, Trinidad and Tobago, Kenya, Malawi, Tanzania, Nigeria, Pakistan, Sierra Leone, Malta and the United States.