ERIS - Eesti teadus- ja arendustegevuse infosüsteem

home > project search > project details

title:	The structure of multi-word verbs in Estonian texts
reg no:	ETF5787
project type:	Estonian Science Foundation research grant
subject:	2.9. System Engineering and Computer Technology 6.3. Linguistics
status:	accepted
institution:	TU Faculty of Philosophy
head of project:	Heiki-Jaan Kaalep
duration:	01.01.2004 - 31.12.2007
description:	There are two main goals. 1. Find the regularities that would make it possible to automatically recognize the multi-word verbs in a text. The regularities should be expressed in a formal way so that one could create a tool for a linguist - a computer program for finding the base form of a multi-word expression, much alike a morphological analyser for finding the base form of a word-form. In addition to that, the program should be able to tell if the sentence contains the expression in the first place at all. 2. Using the program, find multi-word verbs in a 1-million word text corpus and tag them. The list of possible multi-word verbs will be derived from http://www.cl.ut.ee/ee/ressursid/pysiyhendid.html. To create and test the program, the multi-word expressions in a previously morphologically tagged text corpus of 200,000 tokens will be annotated manually. The problem of limiting and precisely defining the borderlines of a verb-centred multi-word unit has to be solved during the project. Thus the most interesting problem linguistically is how to limit the set of noun-verb combinations, by finding the features that define a prototypical multi-word verb, and describing what happens with the features when we gradually move from a multi-word verb towards a free combination. The process when the noun becomes rigid in the context of a certain noun plus verb combination represents a type of grammaticalization in Estonian. The computer program can be used independently as a linguist's tool, just the way we plan to use it for finding the frequencies of the multi-word expressions. More importantly, the program would be a crucial part of any language processing tool to improve the quality of processing Estonian in all stages - morphological disambiguation, syntactic analysis and semantic disambiguation. The database of multi-word expressions, enriched with frequencies, can also be used in language teaching.

project group
no	name	institution	position
1.	Heiki-Jaan Kaalep	TU Faculty of Philosophy	senior research fellow
2.	Kadri Muischnek	University of Tartu	teadur