Bartz/etal/2015a: Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research

Bibtype Incollection
Bibkey Bartz/etal/2015a
Author Bartz, Thomas; Pölitz, Christian; Morik, Katharina and Storrer, Angelika
Ls8autor Bartz, Thomas
Morik, Katharina
Pölitz, Christian
Editor Odijk, Jan
Title Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research
Booktitle Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands
Pages 1-13
Address Linköping
Publisher Linköping University Electronic Press
Abstract Large digital corpora of written language, such as those that are held by the CLARIN-D centers, provide excellent possibilities for linguistic research on authentic language data. Nonetheless, the large number of hits that can be retrieved from corpora often leads to challenges in concrete linguistic research settings. This is particularly the case, if the queried word-forms or constructions are (semantically) ambiguous. The joint project called ‘Corpus-based Linguistic Research and Analysis Using Data Mining’ (“Korpus-basierte linguistische Recherche und Analyse mit Hilfe von Data-Mining” – ‘KobRA’) is therefore underway to investigating the benefits and issues of using machine learning technologies in order to perform after-retrieval cleaning and disambiguation tasks automatically. The following article is an overview of the questions, methodologies and current results of the project, specifically in the scope of corpus-based lexicography/historical semantics. In this area, topic models were used in order to partition search result KWIC lists retrieved by querying various corpora for polysemous or homonym words by the individual meanings of these words.
Year 2015
Projekt Kobra
Url http://www.ep.liu.se/ecp_article/index.en.aspx?issue=116;article=001

  • Impressum