No single path - Finnish lessons in the commercialisation of language engineering research

By Antti Arppe

The Finnish language industry has strong academic roots - a majority of the Finnish IT companies that are primarily involved in creating and providing language engineering software products can trace their origins to individual researchers or research groups at Finnish universities. But that is where the similarities end. At least in the case of 'older' companies founded in the 1980s that have had time to accumulate some corporate history, the paths and strategies from academic start-ups to commercially functioning corporations have varied substantially. With time, some of these companies have found for themselves clear, profitable niches, but for others the quest still continues. In retrospect, it seems that those that have managed to interpret themselves as commercial IT companies first and language engineering companies second have succeeded best. Nevertheless, a major international breakthrough for a Finnish language engineering company is still in waiting.

Academic origins

Some say that having Finnish as one's mother tongue is a gift for a linguist - or a computational linguist for that matter - as its structure with myriads of inflected, derived and compound forms provides a refreshingly different perspective to language as compared to English, the language that has dominated language engineering as long as the field has existed. Whereas in English one can in principle create a prototypical language engineering tool such as a simple spell-checker by merely listing and compressing the most common 100,000 words or so, in Finnish one would need to list tens if not hundreds of millions of word forms to create a speller with comparable coverage using the same technique.

Theories and solutions that might work for English typically don't for Finnish, and it is no surprise that Finnish researchers in the late 1970s were motivated to find a solution that could be used to model Finnish, so that computer programs could efficiently and practically be able to process the language. At the forefront of these pioneers were Kimmo Koskenniemi and Fred Karlsson at the University of Helsinki, and Harri Arnola at the Helsinki University of Technology.

In his doctoral dissertation in 1983, Kimmo Koskenniemi presented the so-called Two-level model (TWOL), which could be used to model, in principle, the morphology of any natural language, and a practical implementation of this model for Finnish. Though it took some years for the international academic community to finally digest and accept Koskenniemi's innovation, in Finland it was received with enthusiasm by the national IT industry.

This led to a succession of co-operation projects between the Department of General Linguistics, headed by Koskenniemi's colleague and collaborator Fred Karlsson, and corporations such as Nokia and IBM Finland, and finally prompted the founding of Lingsoft Inc. in 1986. On his own behalf, Karlsson developed a TWOL description for Swedish in the late 1980s, and proposed in 1990 another new language engineering innovation, the Constraint Grammar formalism (CG).

Both Karlsson and Koskenniemi have advocated advancing in language engineering step-wise, by comprehensively covering a simpler level of language such as morphology before proceeding to a higher, more complex level such as syntax or semantics. Harri Arnola, a researcher and lecturer in artificial intelligence at the Helsinki University of Technology, shared this view, and he focused on undertaking such a development process for Finnish, encompassing comprehensively both a morphological and syntactic model. Financed by SITRA, the Finnish National Fund for Research and Development, Arnola started in 1982 the Kielikone project (literally 'language machine'), which aimed at creating a Finnish natural language interface for querying databases.

In the late 1980s, however, this original goal was dropped in place of machine translation from Finnish to English at the suggestion of (then) Nokia Telecommunications. By this time, the project had already produced its own models for Finnish morphology and syntax, which lead to the founding of Kielikone Ltd. in 1987 in order to commercialise these spin-off technologies.

These afore-mentioned researchers were soon followed by their students, some of whom participated already in the implementation of the pioneering projects. At the University of Helsinki, Olli Blåberg, a student of Koskenniemi and Karlsson, was in 1987 the first to start a company of his own, Lanser Data, in order to develop a machine translation system from Finnish to Swedish for weather forecasts.

Researchers Lauri Carlson and Krister Lindén co-operated in 1988-1992 with Seppo Koskenniemi, Kimmo's elder brother working for IBM in Finland, to develop a machine translation system from English to Finnish. This system, dubbed MENTOR/F, was, however, apparently never producticized. One outcome of this project was nevertheless that Lindén was picked by Koskenniemi to head Lingsoft as its first permanent full-time employee in 1992, when Koskenniemi himself received a permanent chair at the University of Helsinki. In addition to machine translation, Seppo Koskenniemi has been involved since the 1970s in developing basic linguistic descriptions at IBM Dictionary and Linguistic Tools for a variety morphologically rich languages, including Finnish, Turkish, Thai, Hungarian, Czech and Polish, which have been commercially exploited by the company in its information retrieval products.

Arto Anttila, Juha Heikkilä, Timo Järvinen, Pasi Tapanainen and Atro Voutilainen participated first in the practical implementation of Fred Karlsson's Constraint Grammar formalism for English, which was distributed by Lingsoft. In the late 1990s, three of these researchers, Voutilainen, Tapanainen and Järvinen continued their co-operation to develop a new grammar formalism, the Functional Dependency Grammar (FDG), which in some respects resembles the earlier Constraint Grammar but is expressively more powerful. This they decided to commercialise on their own risk, founding Conexor in 1997.

Among the many researchers that participated in the Kielikone project, one should probably mention Timo Honkela and Aarno Lehtola. Honkela turned his attention in the 1990s to Self-Organising Maps (SOM), originally presented by Teuvo Kohonen. In his doctoral thesis, Honkela explored further the use of SOM in natural language processing, and together with Kohonen, Samuel Kaski and Krista Lagus, developed in the WEBSOM project SOM-based methods and tools for visual information retrieval. This research led to the founding of Gurusoft in 1997, which aims at providing solutions in knowledge management of textual information. Aarno Lehtola, on his turn, co-operated for several years with Blåberg at Lanser Data, after which he moved over to VTT, the Technical Research Centre of Finland, where he has participated in or managed a series of language engineering projects, concerning eBusiness, for instance.

Not all Finnish companies involved in some type of language engineering are spin-offs of academic research. TimeHouse, which has carried out a variety of software projects since 1985, created in co-operation with the Laboratory of Acoustics and Audio Signal Processing at Helsinki University of Technology a speech synthesiser for Finnish, MikroPuhe (link in Finnish), already in 1991. The company has also developed a platform for electronic publishing, THText, which WSOY, one of Finland's foremost publishers, has used in its electronic dictionaries. Promentor Solutions, founded in 1987, has a long and successful history in developing and marketing computer based language training systems and language courses. Sandstone.fi Ltd., founded in 1997, provides language technology solutions for wireless media and for the Internet, such as electronic dictionaries that can be used on a mobile phone. Master's Innovations Ltd., founded in 1999, specialises in the development of computer-aided translation technologies and information retrieval systems.

Finally, there are a few companies which have made briefer visits to commercialising language technology, e.g. by carrying out only a single language engineering project. For instance, Republica, founded in 1996, has developed for KONE Elevators a product documentation process and tools based on controlled language.

No single path to success

The first companies all got off to relatively slow starts. Especially Lingsoft was and has remained until today a classical example of the pre-venture-capital era, financing its growth and development with only its revenue. Kielikone did have external investors in SITRA and later on in TEKES, the National Technology Agency, but the sums involved in these relationships were far from the boisterous financing rounds witnessed in the IT industry a few years ago.

In its first years until 1995, Lingsoft undertook projects or licensed pieces of its technology to the industry both in Finland and abroad. Until 1992, the company operated on a project basis and had no permanent, full-time employees, with the annual sales hovering below 100 KEUR on the average. An exception to this happened in 1988 with the licensing of the hyphenation algorithm for Finnish to WordPerfect and later on the Finnish spell-checker.

On its part, Lanser Data succeeded first in licensing a spell-checker for Finnish for incorporation with Lotus Corporation's AmiPro word processor in 1990, and reached an agreement concerning the same module with Microsoft in 1993.

Kielikone commercialised its technology for Finnish as a stand-alone spell-checker named MORFO and a grammar-checker named VIRKKU. Though VIRKKU was the first of its kind for any Nordic language, it was MORFO that brought in the majority of the company's revenues until the early 1990s. Kielikone also licensed its linguistic model of Finnish to Inso Corporation, which on its part adapted it as a spell-checker and licensed then further to Microsoft in the mid-1990s.

After these mixed beginnings, Lingsoft and Kielikone got more pace and volume in their businesses by the mid-1990s. As a sidekick to its major product development effort, the Finnish-to-English machine translation system, Kielikone had compiled a Finnish-to-English electronic dictionary. This resource evolved into a success in 1992, when Käännöskone, Kielikone's subsidiary specialising in hand-held electronic dictionary devices, started distributing this dictionary content in its devices. Käännöskone was later on to merge with the parent company.

Electronic dictionaries developed quickly under the brand name MOT into a full-scale business segment of Kielikone, accounting eventually for a major share of the company's sales. Little by little, Kielikone had either compiled itself or procured from respected major publishers rights to bilingual dictionary content from Finnish to key European languages and back, and in addition the company had set up an efficient productisation process of turning dictionary content into a finished end-user software product. Furthermore, Kielikone had put together an effective sales channel for distributing these products especially to the corporate market.

Lingsoft succeeded in 1995 in securing a significant localisation deal of Microsoft's Answer Wizard help system for a number of languages. Having gotten its foot in the door, Lingsoft managed in the late 1990s to establish itself as Microsoft's prime vendor for proofing tools for the Nordic languages and later German, starting with Finnish and then the other languages. Lingsoft's keys to success were the various resources for the Nordic languages and German that had been created during the project-based period of the company, starting with Koskenniemi's TWOL description for Finnish and the TWOL description for German, with which Lingsoft won the Morpholympics competition in 1994. Both linguistic descriptions, among many others, proved to be fully usable as the bases for the spell-checkers, hyphenators and thesauri that Lingsoft licensed to Microsoft - years after their original development.

A welcome late-bloomer for Lingsoft was the Constraint Grammar formalism and its implementations for various languages, which had not generated major sales for most of the 1990s, being mainly restricted to the academic market. In 1997, it was established that it was practical to use this formalism to develop a grammar checker for Swedish. After this was licensed to Microsoft in 1998, it became the first commercially available tool of its kind for Swedish. This product development process and licensing deal was then repeated for Finnish, Danish and Norwegian in 2000-2001.

It is interesting to note that neither outcome was a direct result of pre-determined strategy. Rather, both Kielikone and Lingsoft had happened to be in the right places with the right resources and technologies at the right times, and the companies had the wits to grab the opportunities.

The road can be bumpy at times

Towards the end of the 1990s, Lingsoft provided among Microsoft's European subcontractors in respect to proofing tools by far the largest set of languages, with only one other subcontractor, MorphoLogic of Hungary, licensing more than a single language. Along with the languages that Lingsoft supported its personnel expanded, as did its sales which tripled 1995-1998 to just above 2 MEUR. Lingsoft realised early on that relying on the long run on a single customer would be a certain risk and that the company would eventually have to diversify its customer base and develop a more independent identity for its product sortiment.

First, linguistic tools for information management, then speech technology, and finally electronic dictionary platforms were considered and developed mainly in-house. Lingsoft also marketed its proofing tools as stand-alone products, under the product names Orthografix and Grammatifix, and as customised versions for corporations. Lingsoft was definitely highly competent in creating linguistically appropriate software modules, and it had a long experience of licensing and distributing these through other IT companies as integrated components in other software products. Thus, Lingsoft needed to set up a sales channel, especially in order to reach the potentially lucrative corporate market. However, the company's resources in both personnel and financing were spread thin, and as a result the completion of the company's first genuinely own end-user products, Lingsoft Pointer and Lingsoft Parrot, dragged on until 2001. Coupled with a dramatic decrease in the general availability of venture capital, this forced Lingsoft to undergo a major belt-tightening and restructuring process, which showed signs of alleviation by the beginning of 2002, supported by a timely new licensing deal.

Kielikone, with the entry of a new managing director in 1997 having a strong IT marketing background, seems to have succeeded in making the most of its extensive product line and strong market position in electronic dictionaries. In 1995-1999, Kielikone's sales doubled to approximately 2 MEUR, without a need to increase the company's personnel at even nearly the same pace. Throughout the 1990s, Kielikone has continued to develop its machine translation system, TranSmart, starting with Nokia and later with Rautaruukki Group, a Finnish steel manufacturing corporation, but the product has remained rather a showpiece of the company's language engineering competence and background than a major source of revenue. In fact, Harri Arnola is pursuing the development of more language-independent machine translation technology in his own company Ganesa.

In the later 1990s, Lanser Data succeeded in licensing its core technology to InXight, a subsidiary of Xerox Corporation. The company had, however, remained the smallest of the older company triplet, and has presently withdrawn from active operations.

New-comers are trying different paths

Conexor and Gurusoft are prime examples in adopting the new paradigm of IT business in Finland. Contrary to the older language engineering companies, they are exploiting external venture or private capital from the very beginning to develop their first products and to make their first approaches towards customers. Furthermore, both have been aiming directly at the international IT corporate market instead of starting little by little from Finland, followed by Sweden and Germany, then the rest of Europe, culminating on an attempt on the American market - an approach that used to be the norm for internalisation for Finnish companies not so long ago.

In retrospect, one is in fact curious as to why especially the older Finnish language engineering companies have continued to shy external private or venture capital, despite proven technology bases existing already at their foundings, whereas new-comers in the other Nordic countries, such as Nordisk Språkteknologi in Norway and Hapax in Sweden, have succeeded in securing millions of euros of venture capital from the very onset. Is it that these Finnish companies have not desired external investment and the accompanying external involvement and ambitious goals, even at a cost to their expansion, or is it that national investors have not seen or understood the potential in the strategies of these companies or in language technology in general?

Conexor is targeting at licensing its basic parsing technology under the Machinese brand name, in addition to marketing and licensing its own products Naviterm, a document indexing tool, and TrueStyler, a proofing tool, directly to international IT companies. What makes Conexor attractive on the international IT scene is that the company's core technological framework is relatively new and that the company has implemented it for the commercially most important Western languages, i.e. markets, namely English, French, Spanish and German. Finally, after a technology driven start, Conexor has not hesitated in acquiring IT marketing and managerial expertise to complement its existing linguistic competence. Conexor has presently some 100 customers worldwide, including e.g. Toshiba Corporation, and the company has turned practically profitable in 2000.

Similarly, Gurusoft is targeting straight at the international market, where even a small slice can mean big money. Since the SOM technology does not necessarily need any rule-based linguistic preprocessing, Gurusoft's technology is in a sense truly language-independent, an obvious advantage compared to the other companies mentioned here that have to build up at least semi-automatically the basic resources for every language they need, however language-independent the principles of their technology may be. On the other hand, Gurusoft has to face established international competitors such as Autonomy.

The obvious challenge for both is how will tiny, hitherto unknown companies from a small Northern European country, Nokialand though may it be, succeed with limited resources in getting the foot in the door just to make their sales pitches to their potential international customers.

Future opportunities and challenges

What, then, are the future prospects of these Finnish language engineering companies? All in all, they employ in the end of 2001 less than a hundred persons, and the annual sales of each company do not top a few million euros. In fact, rumour says that Nokia has more people engaged in some form of language engineering research, specifically in speech technology, than these companies have altogether. Thus, for the time being, the Finnish language industry is still at most a budding high-technology business sector among others. If one analyses the situation of the individual companies, each faces unique challenges if they are to grow from their present situation.

Lingsoft's challenge is finding focus both productwise and languagewise. Lingsoft has a treasure trove of basic language engineering technology, resources and applications for the Nordic languages, the markets of which have unfortunately proven to be small and distinct, requiring each their own market entry strategies and investments as far as end-user products go. The company also has a similar breadth of linguistic technology for German, which entails great potential as the largest market in Europe, but which has also proven a hard nut to crack for Finnish IT companies. Lingsoft has promising product buds in speech technology and electronic dictionaries, but the company needs to work hard on its getting its products to the market, i.e. building the sales channels. With Kielikone dominating the end-user market of electronic dictionaries and proofing tools in Finland, and other similar companies in the other Nordic countries, it seems that Lingsoft's best opportunities are in licensing its technology to internationally operating IT companies, corporate customers interested in customised versions of its existing products, and the German market in general.

Compared to the other language engineering companies in Finland, Lingsoft is unique in its involvement in speech technology, which is obviously an opportunity for the company, especially combined with its combination of comprehensive Nordic and German linguistic coverage that the others lack. In this product area, however, it faces potential competition from other actors such as Philips or Nordisk Språkteknologi in Norway.

Kielikone faces a challenge in internationalisation, unless the company is content to become entrenched within Finland. Kielikone is strong in productisation and management of the end-user market channel, with 2,500 corporate customers in Finland. In this area of strength, Kielikone faces equally strong, locally established companies in the other Nordic countries such as WordFinder Software in Sweden and Clue (link in Norwegian) in Norway. In terms of linguistic technology, which could be an advantage outside Finland, Kielikone is limited to only Finnish and English. On the other hand, the company's own successful experiences in electronic dictionaries in Finland have seemed to indicate that language technology is still not an obligatory feature in that product group.

Despite several apparently satisfied corporate customers for its machine translation system, the number of organisations with an interest and the resources to adopt such a system requiring extensive tailoring for optimal usability is limited in a smallish market such as Finland, specifically as the system is presently restricted to translation from Finnish to English. A positive development is complementing the present MT system with the opposite direction in the presently on-going, EU-funded LINGMACHINE project (a part of the MLIS program), but it remains to be seen whether this leads to a major new source of income for the company.

Perhaps the most significant is that Kielikone and Conexor have chosen to co-operate in this project rather than to proceed their separate ways. In this respect, a somewhat delicate and potentially problematic issue, though most certainly not untypical for a small country, is still that a considerable number of the key owners and actors in these language engineering companies have remained also as active teachers or researchers in the Finnish academia or consulting experts in the same field. It may have been felt until now by both old and new players that it is difficult to try to start new language engineering projects without bumping into the existing - and possible competing - actors in one role or another. The fact that the companies, starting with Kielikone, have chosen to opt for managing directors with backgrounds in mainly IT marketing and strategy, having less personal attachment to the academic side of language engineering, might help in relieving this.

For all these Finnish language engineering companies to achieve healthy growth they need to find a way to break outside their niches and find solutions to their individual challenges. A major problem in providing language technological solutions to a number of languages is that one has to create and maintain linguistic models for each language that a company supports, an endeavour that can be rather costly especially on smaller markets such as Finland or even Sweden. In fact, Lingsoft, Lanser Data and Conexor have ended up developing similar basic linguistic models and tools for English, Swedish and Finnish, among the other languages they support. Furthermore, if one counts in also Kielikone, we end up having four different linguistic models for Finnish!

One has to wonder whether this is an ideal situation in the use of always scant resources from the view point of the Finnish language industry as a whole. Should these companies have focused or co-operated and complemented each other already a lot earlier than in the LINGMACHINE project? On the other hand, one could also speculate if it is rather language independent solutions such as Self-organising maps (SOM), which Gurusoft is promoting, that might have the largest potential for global growth, if that is what these companies seek? Should these Finnish companies be bolder than they have hitherto been regarding external investment and global growth? Only time will tell.

About the author

Antti Arppe is a researcher with a background in industrial management, knowledge engineering and linguistics. Since the early 1990s, he has worked for or with many Finnish language engineering companies, starting with conducting market research for Kielikone in 1993-1994 and later on as a product manager and deputy managing director for Lingsoft in 1994-1998. Afterwards, he has undertaken individual projects for Lingsoft, in which he is a minor share-holder. He has also participated in the EAGLES project as a contributor in 1994. Presently he is a researcher in the nationally funded USIX/GILTA project and pursuing a postgraduate degree in general linguistics at the University of Helsinki, where he also gives annually an introductory course in Commercial language technology.

Acknowledgements:

The author is grateful for the comments, clarifications and additional information given by Harri Arnola (Kielikone, Ganesa), Olli Blåberg (Lanser Data), Lauri Carlson (University of Helsinki), Jaakko Happonen (Lingsoft), Mika Herpiö (Kielikone), Timo Honkela (Gurusoft), Kaarina Hyvönen (Kielikone), Kimmo Koskenniemi (University of Helsinki, Lingsoft), Seppo Koskenniemi (IBM), Katri Luostarinen (CSC), Jan Magnusson (Conexor), Harri Saarikoski (Republica), Pasi Tapanainen (Conexor), Juha Telkkinen (Promentor), Kristian Töyrä (Timehouse), Atro Voutilainen (Conexor), Hanna Westerlund (University of Helsinki), and Graham Wilcock (University of Helsinki).