title: | The structure of multi-word verbs in Estonian texts |
---|---|
reg no: | ETF5787 |
project type: | Estonian Science Foundation research grant |
subject: |
2.9. System Engineering and Computer Technology 6.3. Linguistics |
status: | accepted |
institution: | TU Faculty of Philosophy |
head of project: | Heiki-Jaan Kaalep |
duration: | 01.01.2004 - 31.12.2007 |
description: | There are two main goals. 1. Find the regularities that would make it possible to automatically recognize the multi-word verbs in a text. The regularities should be expressed in a formal way so that one could create a tool for a linguist - a computer program for finding the base form of a multi-word expression, much alike a morphological analyser for finding the base form of a word-form. In addition to that, the program should be able to tell if the sentence contains the expression in the first place at all. 2. Using the program, find multi-word verbs in a 1-million word text corpus and tag them. The list of possible multi-word verbs will be derived from http://www.cl.ut.ee/ee/ressursid/pysiyhendid.html. To create and test the program, the multi-word expressions in a previously morphologically tagged text corpus of 200,000 tokens will be annotated manually. The problem of limiting and precisely defining the borderlines of a verb-centred multi-word unit has to be solved during the project. Thus the most interesting problem linguistically is how to limit the set of noun-verb combinations, by finding the features that define a prototypical multi-word verb, and describing what happens with the features when we gradually move from a multi-word verb towards a free combination. The process when the noun becomes rigid in the context of a certain noun plus verb combination represents a type of grammaticalization in Estonian. The computer program can be used independently as a linguist's tool, just the way we plan to use it for finding the frequencies of the multi-word expressions. More importantly, the program would be a crucial part of any language processing tool to improve the quality of processing Estonian in all stages - morphological disambiguation, syntactic analysis and semantic disambiguation. The database of multi-word expressions, enriched with frequencies, can also be used in language teaching. |
project group | ||||
---|---|---|---|---|
no | name | institution | position | |
1. | Heiki-Jaan Kaalep | TU Faculty of Philosophy | senior research fellow | |
2. | Kadri Muischnek | University of Tartu | teadur |