title: The structure of multi-word verbs in Estonian texts
reg no: ETF5787
project type: Estonian Science Foundation research grant
subject: 2.9. System Engineering and Computer Technology
6.3. Linguistics
status: accepted
institution: TU Faculty of Philosophy
head of project: Heiki-Jaan Kaalep
duration: 01.01.2004 - 31.12.2007
description: There are two main goals.
1. Find the regularities that would make it possible to automatically recognize the multi-word verbs in a text. The regularities should be expressed in a formal way so that one could create a tool for a linguist - a computer program for finding the base form of a multi-word expression, much alike a morphological analyser for finding the base form of a word-form. In addition to that, the program should be able to tell if the sentence contains the expression in the first place at all.
2. Using the program, find multi-word verbs in a 1-million word text corpus and tag them. The list of possible multi-word verbs will be derived from http://www.cl.ut.ee/ee/ressursid/pysiyhendid.html.
To create and test the program, the multi-word expressions in a previously morphologically tagged text corpus of 200,000 tokens will be annotated manually.
The problem of limiting and precisely defining the borderlines of a verb-centred multi-word unit has to be solved during the project. Thus the most interesting problem linguistically is how to limit the set of noun-verb combinations, by finding the features that define a prototypical multi-word verb, and describing what happens with the features when we gradually move from a multi-word verb towards a free combination. The process when the noun becomes rigid in the context of a certain noun plus verb combination represents a type of grammaticalization in Estonian.
The computer program can be used independently as a linguist's tool, just the way we plan to use it for finding the frequencies of the multi-word expressions. More importantly, the program would be a crucial part of any language processing tool to improve the quality of processing Estonian in all stages - morphological disambiguation, syntactic analysis and semantic disambiguation.
The database of multi-word expressions, enriched with frequencies, can also be used in language teaching.

project group
no name institution position  
1.Heiki-Jaan KaalepTU Faculty of Philosophysenior research fellow 
2.Kadri MuischnekUniversity of Tartuteadur