|
Homepage of R. Harald Baayen
|
|||||||||||||||||
|
1. Research interests 2. Students 3. CV and publications 4. Software |
Research interests
I am interested in words: their internal structure, their meaning, their distributional properties, how they are used in different speech communities and registers, and how they are processed in language comprehension and speech production. The following links provide some more information.
morphological productivity The number of words that can be described with a word formation rule varies substantially. For instance, in English, the number of words that end in the suffix -th (e.g., warmth) is quite small, whereas there are thousands of words ending in the suffix -ness (e.g., goodness). The term 'morphological productivity' is generally used informally to refer to the number of words in use in a language community that a rule describes. For a proper understanding of the intriguing phenomenon of morphological productivity, I believe it is crucial to distinguish between (a) language-internal, structural factors, (b) processing factors, and (c) social and stylistic factors. Formal linguists tend to focus on (a), psychologists on (b), and neither like to think about (c). Sociologists and anthropologists would probably only be interested in (c). In order to make the rather fuzzy notion of quantity that is part of the concept of morphological productivity more precise, I have developed several statistical measures based on conditional probabilities for assessing productivity (Baayen 1991, 1992, Yearbook of Morphology). These measures assess the outcome of all three kinds of factors mentioned above, and provide an objective starting point for interpretation given for the kind of materials sampled in the corpora from which they are calculated. Baayen and Renouf (1996, Language) show how these measures shed light on the role of structural factors. Hay (2003, Causes and Consequences of Word Structure, Routledge) recently documented the importance of phonological processing factors, which are addressed for a large sample of English affixes in Hay and Baayen (2002, Yearbook of Morphology, 2003, Italian Journal of Linguistics), see also Hay and Plag (2004, Natural Language and Linguistic Theory, and Krott, Schreuder and Baayen, 1999, Linguistics). It is my impression that social and stylistic factors are at least equally important as the structural and processing factors (Baayen, 1994 JQL, Plag, Dalton-Puffer and Baayen, 1999, English Language and Linguistics, Polman and Baayen, 2001, Computers and the Humanities, Baayen and Neijt, 1997, Linguistics) for predicting to what extent word formation patterns are actually used in speech communities. A recent study linking the distributional and experimental approaches is Plag, I. and Baayen, R. H. (2009). Suffix ordering and morphological processing. Language, 85, 106-149. morphological processing There are two strands in my work on morphological processing. The first line of research uses frequency effects as a tool to trace the extent to which our memory for words, known in this domain of research as the mental lexicon, registers word-specific information. In formal linguistics, the lexicon is generally viewed as the repository of the unpredictable, which leads to the prediction that fully predictable complex words, whether derived or inflected, would not leave traces in lexical memory. The evidence that has accumulated in my lab over the years shows that this prediction is contradicted by ubiquitous frequency effects both in visual (see, e.g, Baayen, Dijkstra, and Schreuder, 1997, Journal of Memory and Language, Baayen, Schreuder, De Jong, and Krott, 2002, edited volume) and auditory comprehension (Baayen, McQueen, Dijkstra, and Schreuder, 2003, edited volume), as well as in production (Baayen, Levelt, Schreuder, Ernestus, 2008). However, the extent of storage decreases when, due to rich morphology, the instance space becomes sparse and exemplars are represented only weakly or fade from memory (Bertram, Laine, Baayen, Schreuder, Hyona, 1999, Cognition). Lexical memory also turns out to be highly sensitive to the fine details of the acoustic signal and the probability distributions of these details among the exemplars in the mental lexicon (Kemps, Ernestus, Schreuder, and Baayen, Memory and Cognition, 2005). Work by Mark Pluymaekers, Mirjam Ernestus, and Victor Kuperman indicates, furthermore, that the degree of affix reduction and assimilation in complex words correlates with frequency of use, which indicates that a word's specific frequency co-determines the fine details of its phonetic realization. See, e.g., Kuperman, Pluymaekers, Ernestus, and Baayen (2006), and Pluymaekers, Ernestus, and Baayen (2005). This array of findings is problematic for strictly decompositional models of morphological processing, and argues for memory-based morphology. My second line of research addresses the nature of morphological rules. In traditional formal linguistics, morphological rules are context-sensitive symbolic operations that exist independently of the repository of basic and irreducable formatives in the lexicon. These rules 'fill in' whatever is predictable, and are supposed to be used on-line in comprehension and production. Within exemplar-based approaches, in which the lexicon is viewed as a highly redundant, exquisite memory, it becomes possible to view rules as generalizations over stored exemplars. The general idea is unfolded in several seminal studies by Bybee (e.g., her 1985 and 2001 books). However, the challenge for memory-based morphology is to formalize exemplar-based theory sufficiently to allow for concrete predictions that can be tested experimentally. One possibility that is pursued vigorously in cognitive science is to make use of the technology of artificial neural networks (ANNs, see the seminal study of Rumelhart and McClelland, 1986, and subsequent studies). ANNs have many advantages. They are powerful statistical pattern associators, they implement a form of data compression, they can be lesioned for modeling language deficits, they are not restricted to discrete (or discretisized) input, etc. On the downside, it is unclear to what extent current connectionist models can handle the symbolic aspects of language (see, e.g., Levelt, 1991, Sprache und Kognition), and the performance of ANNs is often not straightforwardly interpretable. Although the metaphor of neural networks is appealing, the connectionist models that I am aware of make use of training algorithms that are mathematically attractive but biologically implausible. Especially the evidence advanced by Ullman and colleagues for distinct procedural versus declarative systems in the brain challenges simple, undifferentiated network models in which rules and representations are fully merged. In our lab, we have therefore mainly explored other statistical techniques that maintain the distinction between rules and representations, that provide more immediate insight into what lexical properties drive morphological generalizations, that stay close to general, well-validated techniques in statistics and machine learning for data analysis, and that therefore provide some protection against piecemeal data (over)fitting. Techniques that have been especially useful for us are Skousen's AML, Daeleman's TiMBL (an exemplar-based system with impressive data compression), as well as Classification and Regression Trees (Breiman et al., 1984). We have used these techniques succesfully to model rule-less, paradigmatically governed aspects of Dutch morphology (interfixes, Krott, Baayen and Schreuder, 2001, Linguistics; regular past tense formation, Ernestus and Baayen, 2003, Language; 2004, Linguistics; incomplete neutralization, Ernestus and Baayen, in press, Papers in Laboratory Phonology, Ernestus and Baayen, 2006). For a study using a connectionist network with similar performance as AML and TiMBL, see Moscoso del Prado Martin, Ernestus, and Baayen, 2004, Brain and Language. The importance of paradigmatic relations for lexical processing has also become evident from our work on the morphological family size effect (Bertram, Baayen and Schreuder, 2000, Journal of Memory and Language, De Jong, Schreuder and Baayen, 2000, Language and Cognitive Processing, Moscoso del Prado Martin, Kostic, and Baayen, 2004, Cognition; for auditory comprehension, see Wurm, Ernestus, Schreuder and Baayen). For work addressing the imbalance of semantic interconnectivity for regular and irregular verbs and its consequences for lexical processing, see Baayen and Moscoso del Prado Martin (2004) and Tabak, Schreuder, and Baayen (2004). A connectionist model for visual lexical decision that captures family size effects along with many other lexical variables is described in the doctoral dissertation of Moscoso del Prado Martin, available in pdf from his homepage. Interestingly, work on compound interpretation by Cristina Gagne and work by Ingo Plag and collaborators on stress assignment in English compounds suggests that local analogical generalization is widespread across different areas of morphology. Our approach to morphological structure and its consequences for lexical
processing is based on the conviction that the mental lexicon is a highly
sensitive memory system combined with an equally sensitive system for
memory-based probabilistic generalization (for a broader scope, see Bod, Hay,
and Jannedy, Probability theory in linguistics, 2003, the MIT Press). The
insight that probability is a crucial aspect of linguistic cognition is one we
share with both connectionist theories and with stochastic optimality theory
(SOT, Boersma and Hayes, 2001, Linguistic Inquiry). While the surprising level
of item-specific detail registered in lexical memory is fully compatible with
connectionist approaches, it is not easily incorporated in SOT in an
insightful way. SOT's gradual learning algorithm extracts information from
individual items encountered during learning in order to weight higher-order
generalizations (constraints), explicitly without storing the details
characterizing these individual items. The high degree to which we know fine
word-specific details leave their traces in lexical memory could perhaps be
accounted for by adding more and more word-specific constraints to the theory,
but it is unclear to us how then to avoid an uninsightful combinatorial explosion of
word-specific constraints. Although gready learning (economizing on storage)
and lazy learning (using as many exemplars as possible) sometimes lead to computationally
equivalent theories (see, e.g., this
study by Keuleers and Sandra for a comparison with the Memory Based Learner of Albright and
Hayes (Cognition, 2003), the accumulating evidence for exemplars of regular inflected words
(see, e.g., the book chapter by
Milin et al.
and recent work by Kuperman)
suggests that the role of memory in language may well have been seriously underestimated.
A summary of my general approach to these issues is
Baayen, R.H. (2007). Storage and computation in the mental lexicon. In G. Jarema and G. Libben (Eds.), The Mental Lexicon: Core Perspectives, Elsevier, 81-104.
.
My interest in stylometry was sparked by the seminal work of Burrows, who documented stylistic, regional, and authorial differences by means of principal components analyses applied to the relative frequencies of the highest-frequency function words in literary texts. Baayen (1994, JQL) is a study exploring the potential of a productivity measure for distinguishing different text types. Baayen, Tweedie and Van Halteren (1996, Literary and Linguistic Computing) compared function words with syntactic tags in an authorship identification study. Baayen, Van Halteren, Neijt and Tweedie (2002, Proceedings JADT) and Van Halteren, Baayen, Tweedie, Haverkort, and Neijt (JQL, 2005) showed that in a controlled experiment, the writings of 8 students of Dutch language and literature at the university of Nijmegen could be correctly attributed to a remarkable degree (80\% to 95\% correct attributions). For a study on the socio-stylistic stratification of the Dutch suffix -lijk in speech and writing, see Keune et al. (2006). statistics, exploratory data analysis and visualization My book on word frequency distributions (Kluwer, 2001) gives an overview of statistical models for word frequency distributions. Recently, Evert developed the LNRE extension of the Zipf-Mandelbrot model (Evert, 2004, JADT proceedings). I recommend R as an exquisite open-source tool for data analysis. My book with Cambridge University Press, Analyzing Linguistic Data: A practical introduction to Statistics using R provides an introduction to R for linguists and psycholinguists. An older version of this book, with various typos and small errors, is available (see my list of publications for a link). If you make a hardcopy you should commit yourself to buying the final printed version. The data sets and convenience functions used in this textbook are available in the 'languageR' package that is available from the CRAN archives. A webpage that is worth visiting provides great examples of trellis graphics. Mixed effects modeling is a beautiful technique for understanding the structure of data sets with both fixed and random effects as typically obtained in (psycho)linguistic experiments. My favorite reference for nested random effects is Pinheiro and Bates (2000). For data sets with crossed fixed and random effects, the recent (revolutionary) lme4 package of Douglas Bates provides exactly the tools we need. A paper co-authored with Doug Davidson and Douglas Bates in JML and chapter 7 of my introductory textbook provide various examples of its use. Bates' lmer function can also be used for generalized linear mixed-effects models. A highly recommended more technical book on mixed effects modeling that provides detailed examples of crossed random effects for subjects and items is Julian J. Faraway (2006), Extending Linear Models with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models, Chapman and Hall. useful links for (psycho)linguists For an efficient, stable and virus-free operating system which provides a cutting edge for research, there is Linux and the Ubuntu distribution. For scripting, there is Python. For statistical analysis, there is R. For speech analysis, there is Praat. For collocation software, there is Evert's UCS toolkit. For the Celex lexical database, see the linguistic data consortium. |
||||||||||||||||
|
| |||||||||||||||||